Simultaneous Speech Recognition Based on Automatic Missing Feature Mask Generation by Integrating Sound Source Separation
نویسندگان
چکیده
Our goal is to realize a humanoid robot that has the capabilities of recognizing simultaneous speech. A humanoid robot under real-world environments usually hears a mixture of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. In particular, an interface between sound source separation and speech recognition is important. In this paper, we designed an interface between sound source separation and speech recogniton by applying Missing Feature Theory (MFT). In this method, spectral sub-bands distorted by sound source separation are detected from input speech as missing features. The detected missing features are masked on recognition not to affect the system badly. Therefore, this method is more flexible when noises change dynamically and drastically. It is the most important issue how distorted spectral sub-bands are detected. To solve the issue, we used speech feature apropriate for MFT-based ASR, and developed automatic missing feature mask generation. As a speech feature, we used a Mel-Scale Log Spectral (MSLS) feature instead of Mel-Frequency Cepstrum Coefficient (MFCC) which is commonly used for ASR. We presented a method of generating missing feature mask automatically by using information from sound source separation. To evaluate our method, we implemented it in a humanoid robot SIG2, and performed the experiments on recognition of three simultaneous isolated words. As a result, our method outperformed conventional ASR with MSLS feature.
منابع مشابه
Leak energy based missing feature mask generation for ICA and GSS and its evaluation with simultaneous speech recognition
This paper addresses automatic speech recognition (ASR) for robots integrated with sound source separation (SSS) by using leak noise based missing feature mask generation. The missing feature theory (MFT) is a promising approach to improve noise-robustness of ASR. An issue in MFT-based ASR is automatic generation of the missing feature mask. To improve robot audition, we applied this theory to ...
متن کاملImproving speech recognition of two simultaneous speech signals by integrating ICA BSS and automatic missing feature mask generation
Robot audition systems require capabilities for sound source separation and the recognition of separated sounds, since we hear a mixture of sounds in our daily lives, especially mixed of speech. We report a robot audition system with a pair of omni-directional microphones embedded in a humanoid that recognizes two simultaneous talkers. It first separates the sound sources by Independent Compone...
متن کاملDesign and Implementation of Robot Audition System 'HARK' - Open Source Software for Listening to Three Simultaneous Speakers
This paper presents the design and implementation of the HARK robot audition software system consisting of sound source localization modules, sound source separation modules and automatic speech recognition modules of separated speech signals that works on any robot with any microphone configuration. Since a robot with ears may be deployed to various auditory environments, the robot audition sy...
متن کاملSoft missing-feature mask generation for simultaneous speech recognition system in robots
This paper addresses automatic soft missing-feature mask (MFM) generation based on a leak energy estimation for a simultaneous speech recognition system. An MFM is used as a weight for probability calculation in a recognition process. In a previous work, a threshold-base-zero-or-one function was applied to decide if spectral parameter can be reliable or not for each frequency bin. The function ...
متن کاملTime-Frequency Masking: Linking Blind Source Separation and Robust Speech Recognition
In order to deploy automatic speech recognition (ASR) effectively in real world scenarios it is necessary to handle hostile environments with multiple speech and noise sources. One classical example is the so-called "cocktail party problem" (Cherry, 1953), where a number of people are talking simultaneously in a room and the ASR task is to recognize the speech content of one or more target spea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006